class: center, middle, inverse, title-slide .title[ # Lecture 11: Machine Learning ] .subtitle[ ## Tree-Based Machine Learning Methods ] .author[ ### James Sears*
AFRE 891 SS24
Michigan State University ] .date[ ### .small[
*Parts of these slides are adapted from
“Prediction and Machine-Learning in Econometrics”
by Ed Rubin, used under
CC BY-NC-SA 4.0
.] ] --- exclude: true --- <style type="text/css"> # CSS for including pauses in printed PDF output (see bottom of lecture) @media print { .has-continuation { display: block !important; } } .remark-code-line { font-size: 95%; } .small { font-size: 75%; } .scroll-output-full { height: 90%; overflow-y: scroll; } .scroll-output-75 { height: 75%; overflow-y: scroll; } </style> # Table of Contents **Part 3: Tree-Based Methods** 1. [Decision Trees](#trees) 1. [Random Forests](#forests) 1. [Machine Learning for Causal Treatment Effect Estimation](#causal) 1. [Deep Learning (if time - not a tree method)](#deep) --- # Prologue Packages we'll use today: ```r pacman::p_load(grf, fixest, janitor, magrittr, parsnip, ranger, rpart, rpart.plot, rsample, scales, tidymodels, tidyverse, vip) ``` As well, let's use two datasets: * The ISL default data again * Heart disease data ```r default_df <- ISLR::Default heart_df <- read_csv("data/Heart.csv") %>% dplyr::select(-1) %>% rename(HeartDisease = AHD) %>% clean_names() ``` --- # Prologue Throughout our econometric training, we've become experts in predicting an outcome by writing down a model with a .hi-blue[specific functional form] for the relationship between `\(\color{#7BBD00}{\mathbf{y}}\)` and `\(\color{#6A5ACD}{\mathbf{X}}\)`. -- .hi-medgrn[Decision Trees] (or classification and regression trees, CART) are a supervised learning method that offers an alternative, .hi-medgrn[data-driven] way of generating predictions by partitioning the covariate space into groups and using the grouping to generate predictions. -- After working through decision trees, we'll shift our focus to .hi-green[forest methods] that combine many, many trees to overcome drawbacks of individual trees and yield predictions of alternate objects - including heterogeneous treatment effects. --- class: inverse, middle name: trees # 🌲 Decision Trees 🌲 --- # Decision Trees: Punchline .hi-medgrn[Goal] - Split the .it[predictor space] (our `\(\color{#6A5ACD}{\mathbf{X}}\)`) into regions (partitions) - Predict the most-common value within a region -- .hi-medgrn[Attributes of Decision Trees] 1. Work for .hi-purple[both classification and regression] -- 1. Are inherently .hi-green[nonlinear] -- 1. Are relatively .hi-pink[simple] and .hi-pink[interpretable] -- 1. Often .hi-blue[underperform] relative to competing methods -- 1. easily extend to .hi-red[very competitive ensemble methods] (forests with *many* trees).super[🌲] .footnote[ 🌲 Though the ensembles will be much less interpretable. ] --- # Growing Decision Trees .hi-blue[1\. Divide the predictor space] into `\(J\)` regions (using predictors `\(\mathbf{x}_1,\ldots,\mathbf{x}_p\)`) -- * .hi-pink[Regression Trees:] Choose the regions to .hi-green[minimize RSS] across all `\(J\)` regions $$ `\begin{align} \sum_{j=1}^{J} \left( y_i - \hat{y}_{R_j} \right)^2 \end{align}` $$ --- # Growing Decision Trees: Steps .hi-blue[1\. Divide the predictor space] into `\(J\)` regions (using predictors `\(\mathbf{x}_1,\ldots,\mathbf{x}_p\)`) * .hi-pink[Classification Trees:] Choose the regions to .hi-green[minimize Gini index] or .hi-green[Entropy] -- .pull-left[ $$ `\begin{align} G = \sum_{k=1}^{K} \hat{p}_{mk} \left( 1 - \hat{p}_{mk} \right) \end{align}` $$ ] -- .pull-right[ $$ `\begin{align} Entropy = - \sum_{k=1}^{K} \hat{p}_{mk} \log \left( \hat{p}_{mk} \right) \end{align}` $$ ] <br> * Keep splitting until a specified threshold is reached (e.g. at most 5 observations per region) --- # Classification Tree .hi-medgrn[Q:] Why are we using the Gini index or entropy (*vs.* error rate)? -- .hi-medgrn[A:] The error rate isn't sufficiently sensitive to grow good trees. <br> The Gini index and entropy tell us about the .hi-blue[composition] of the leaf. -- Consider two different leaves in a three-level classification. -- .pull-left[ .center.hi-slate[Leaf 1] - .b[A:] 51, .b[B:] 49, .b[C:] 00 - .hi-orange[Error rate:] 49% - .hi-purple[Gini index:] 0.4998 - .hi-pink[Entropy:] 0.6929 ] -- .pull-right[ .center.hi-slate[Leaf 2] - .b[A:] 51, .b[B:] 25, .b[C:] 24 - .hi-orange[Error rate:] 49% - .hi-purple[Gini index:] 0.6198 - .hi-pink[Entropy:] 1.0325 ] .clear-up[ The .hi-purple[Gini index] and .hi-pink[entropy] tell us about the distribution. ] --- # Growing Decision Trees: Steps .hi-blue[2\. Make predictions] using the regions' mean outcome. <br> <br> For region `\(R_j\)` predict `\(\hat{y}_{R_j}\)` where $$ `\begin{align} \hat{y}_{R_j} = \frac{1}{n_j} \sum_{i\in R_j} y \end{align}` $$ --- # Growing Decision Trees Let's visualize this by growing a classification tree for predicting credit card defaults based on income and card balances. --- # Growing Decision Trees Consider our two-dimensional covariate space: <img src="data:image/png;base64,#ML-Trees_files/figure-html/plot-raw-1.png" width="88%" style="display: block; margin: auto;" /> --- # Growing Decision Trees The .hi-pink[first partition] splits balance at $1,800. <img src="data:image/png;base64,#ML-Trees_files/figure-html/plot-split1-1.png" width="88%" style="display: block; margin: auto;" /> --- # Growing Decision Trees The .hi-pink[second partition] splits balance at $1,972, (.it[conditional on bal. > $1,800]). <img src="data:image/png;base64,#ML-Trees_files/figure-html/plot-split2-1.png" width="88%" style="display: block; margin: auto;" /> --- # Growing Decision Trees The .hi-pink[third partition] splits income at $27K .b[for] bal. between $1,800 and $1,972. <img src="data:image/png;base64,#ML-Trees_files/figure-html/plot-split3-1.png" width="88%" style="display: block; margin: auto;" /> --- # Growing Decision Trees These three partitions give us four .hi-pink[regions]... <img src="data:image/png;base64,#ML-Trees_files/figure-html/plot-split3b-1.png" width="88%" style="display: block; margin: auto;" /> --- # Growing Decision Trees .hi-pink[Predictions] cover each region (e.g. using the region's .hi-pink[most common class]). <img src="data:image/png;base64,#ML-Trees_files/figure-html/plot-split3c-1.png" width="88%" style="display: block; margin: auto;" /> --- # Growing Decision Trees .hi-pink[Predictions] cover each region (e.g. using the region's .hi-pink[most common class]). <img src="data:image/png;base64,#ML-Trees_files/figure-html/plot-split3d-1.png" width="88%" style="display: block; margin: auto;" /> --- # Growing Decision Trees .hi-pink[Predictions] cover each region (e.g. using the region's .hi-pink[most common class]). <img src="data:image/png;base64,#ML-Trees_files/figure-html/plot-split3e-1.png" width="88%" style="display: block; margin: auto;" /> --- # Growing Decision Trees .hi-pink[Predictions] cover each region (e.g. using the region's .hi-pink[most common class]). <img src="data:image/png;base64,#ML-Trees_files/figure-html/plot-split3f-1.png" width="88%" style="display: block; margin: auto;" /> --- # Visualizing the Decision Tree .hi-medgrn[Q:] Why do we call it a tree? -- .hi-medgrn[A:] Because we can easily represent this partitioning process in the form of a .hi-pink[decision tree!] --- # Visualizing the Decision Tree Our first split occurs at a Balance of $1800: .clear-space[ ] .center[ <img src = "images/tree1.svg" width = "600" /img> ] --- # Visualizing the Decision Tree If Balance > 1800 Is False, we get our first region: .clear-space[ ] .center[ <img src = "images/tree2.svg" width = "600" /img> ] --- # Visualizing the Decision Tree If Balance < 1800 Is True, we next split on Balance < 1,972 .clear-space[ ] .center[ <img src = "images/tree3.svg" width = "600" /img> ] --- # Visualizing the Decision Tree If False, we get our second terminal node (76% default) .clear-space[ ] .center[ <img src = "images/tree4.svg" width = "600" /img> ] --- # Visualizing the Decision Tree If True, we get our final split at Income > 27,000 .clear-space[ ] .center[ <img src = "images/tree5.svg" width = "600" /img> ] --- # Decision Tree Anatomy The .hi-pink[regions] correspond to the tree's .hi-pink[terminal nodes] (or .hi-pink[leaves]) .clear-space[ ] .center[ <img src = "images/tree6.svg" width = "600" /img> ] --- # Decision Tree Anatomy The graph's .hi-pink[separating lines] correspond to the tree's .hi-pink[internal nodes] .clear-space[ ] .center[ <img src = "images/tree7.svg" width = "600" /img> ] --- # Decision Tree Anatomy The segments connecting the nodes are the tree's .hi-pink[branches] .clear-space[ ] .center[ <img src = "images/tree8.svg" width = "600" /img> ] --- # Growing The Tree .hi-medgrn[Problem:] Examining every possible partition (every .pink[cutoff] `\(\color{#e64173}{s}\)` of every .purple[predictor] `\(\color{#6A5ACD}{\mathbf{x}_j}\)`) is .hi-medgrn[computationally infeasible.] -- .hi-blue[Solution:] a top-down, *greedy* algorithm named .hi-blue[recursive binary splitting] -- * .hi-blue[Recursive:] starts with the "best" split, then find the next "best" split, etc. -- * .hi-blue[Binary:] each split creates only two branches: "yes" and "no" -- * .hi-blue[Greedy:] each step makes the "best" split without consideration for the overall process --- # Splitting Example Consider the dataset
i
pred
y
x1
x2
1
0
0
1
4
2
0
8
3
2
3
0
6
5
6
-- With just three observations, each variable only has two actual splits..super[🌲] .footnote[ 🌲 You can think about cutoffs as the ways we divide observations into two groups. ] --- # Splitting Example One possible split: x.sub[1] at 2, which yields .purple[(.b[1]) x.sub[1] < 2] .it[vs.] .pink[(.b[2]) x.sub[1] ≥ 2]
i
pred
y
x1
x2
1
0
0
1
4
2
7
8
3
2
3
7
6
5
6
--- # Splitting Example One possible split: x.sub[1] at 2, which yields .purple[(.b[1]) x.sub[1] < 2] .it[vs.] .pink[(.b[2]) x.sub[1] ≥ 2]
i
pred
y
x1
x2
1
0
0
1
4
2
7
8
3
2
3
7
6
5
6
This split yields an RSS of .purple[0.super[2]] + .pink[1.super[2]] + .pink[(-1).super[2]] = 2. -- .note[Note.sub[1]] Splitting x.sub[1] at 2 yields the same results as 1.5, 2.5—anything in (1, 3). --- # Splitting Example An alternative split: x.sub[1] at 4, which yields .purple[(.b[1]) x.sub[1] < 4] .it[vs.] .pink[(.b[2]) x.sub[1] ≥ 4]
i
pred
y
x1
x2
1
4
0
1
4
2
4
8
3
2
3
6
6
5
6
This split yields an RSS of .purple[(-4).super[2]] + .purple[4.super[2]] + .pink[0.super[2]] = 32. -- .it[Previous:] Splitting x.sub[1] at 4 yielded RSS = 2. .it.grey-light[(Much better)] --- # Splitting Example Another split: x.sub[2] at 3, which yields .purple[(.b[1]) x.sub[1] < 3] .it[vs.] .pink[(.b[2]) x.sub[1] ≥ 3]
i
pred
y
x1
x2
1
3
0
1
4
2
8
8
3
2
3
3
6
5
6
This split yields an RSS of .pink[(-3).super[2]] + .purple[0.super[2]] + .pink[3.super[2]] = 18. --- # Splitting Example Final split: x.sub[2] at 5, which yields .purple[(.b[1]) x.sub[1] < 5] .it[vs.] .pink[(.b[2]) x.sub[1] ≥ 5]
i
pred
y
x1
x2
1
4
0
1
4
2
4
8
3
2
3
6
6
5
6
This split yields an RSS of .pink[(-4).super[2]] + .pink[4.super[2]] + .purple[0.super[2]] = 32. --- # Splitting Example Across our four possible splits (two variables each with two splits) - x.sub[1] with a cutoff of 2: .b[RSS] = 2 - x.sub[1] with a cutoff of 4: .b[RSS] = 32 - x.sub[2] with a cutoff of 3: .b[RSS] = 18 - x.sub[2] with a cutoff of 5: .b[RSS] = 32 our split of x.sub[1] at 2 generates the lowest RSS. --- # Splitting .note[Note:] Categorical predictors work in exactly the same way. <br>We want to try .b[all possible combinations] of the categories. -- .ex[Ex:] For a four-level categorical predictor (levels: A, B, C, D) .pull-left[ - Split 1: .pink[A|B|C] .it[vs.] .purple[D] - Split 2: .pink[A|B|D] .it[vs.] .purple[C] - Split 3: .pink[A|C|D] .it[vs.] .purple[B] - Split 4: .pink[B|C|D] .it[vs.] .purple[A] ] .pull-right[ - Split 5: .pink[A|B] .it[vs.] .purple[C|D] - Split 6: .pink[A|C] .it[vs.] .purple[B|D] - Split 7: .pink[A|D] .it[vs.] .purple[B|C] ] .clear-up[ we would need to try 7 possible splits. ] --- # Splitting Once we make our a split, we then continue splitting, <br>.b[conditional] on the regions from our previous splits. So if our first split creates R.sub[1] and R.sub[2], then our next split <br>searches the predictor space only in R.sub[1] or R.sub[2]..super[🌲] .footnote[ 🌲 We are no longer searching the full space—it is conditional on the previous splits. ] -- The tree continue to .b[grow until] it hits some specified threshold, <br>_e.g._, at most 5 observations in each leaf. --- # Too many Splits? One can have .hi-blue[too many splits.] .hi-medgrn[Q:] Why? -- .hi-medgrn[A:] "More splits" means -- 1. more flexibility (think about the bias-variance tradeoff/overfitting) -- 1. less interpretability (one of the selling points for trees) -- .hi-medgrn[Q:] So what can we do? -- .hi-medgrn[A:] Prune your trees! --- # Pruning .hi-medgrn[The idea:] Some regions may increase .hi[variance] more than they reduce .hi[bias]. * By removing these regions, we gain in test MSE. .hi-medgrn[Candidates for trimming:] Regions that do not .b[reduce RSS] very much. -- .hi-medgrn[Updated strategy:] Grow big trees `\(T_0\)` and then trim `\(T_0\)` to an optimal .attn[subtree]. -- .hi-medgrn[Updated problem:] Considering all possible subtrees can get expensive. --- # Pruning .hi-medgrn[Updated solution:] add a penalty for complexity with .hi-medgrn[cost-complexity pruning] * Pay a penalty to become more complex with a greater number of regions `\(|T|\)` -- For regression trees.super[1]: $$ `\begin{align} \sum_{m=1}^{|T|} \sum_{i:x\in R_m} \left( y_i - \hat{y}_{R_m} \right)^2 + \alpha |T| \end{align}` $$ For any value of `\(\alpha (\ge 0)\)`, we get a subtree `\(T\subset T_0\)`. -- We choose `\(\alpha\)` via cross validation. .footnote[ 1 For classification trees we instead penalize the error rate ] --- # Decision Trees in R To train decision trees in R, we'll use .hi-slate[tidymodels] (and several others) for modeling and machine learning using .hi-slate[tidyverse] principles.super[2] .footnote[2\. [tidymodels](https://www.tidymodels.org/) allow you to do a *ton* of stuff that we're just scratching the surface of.] -- Let's first set the seed and define our cross-validation with `vfold_cv` from .hi-slate[rsample] * `v` the number of folds (here we'll use 5) ```r # Define our CV split set.seed(12345) default_cv <- default_df %>% vfold_cv(v = 5) ``` --- # Decision Trees in R Next, let's use the `decision_tree()` function from .hi-slate[parsnip] to set up the parameters of the tree: - `mode`: `"regression"` or `"classification"` - `cost_complexity`: the cost (penalty) paid for complexity - `tree_depth`: *max.* tree depth (max. number of splits in a "branch") - `min_n`: *min.* # of observations for a node to split - Use the .hi-slate[rpart] "engine" to grow the tree ```r # Define the decision tree default_tree <- decision_tree( mode = "classification", # cost_complexity = tune(), tree_depth = tune(), min_n = 10 # Arbitrarily choosing '10' ) %>% set_engine("rpart") ``` --- # Decision Trees in R Next, use `recipe()` to build a recipe describing the formula/data ```r # Define recipe default_recipe <- recipe(default ~ ., data = default_df) ``` -- And define the workflow, first adding the model then the recipe ```r # Define the workflow default_flow <- workflow() %>% add_model(default_tree) %>% add_recipe(default_recipe) ``` --- # Decision Trees in R Lastly, execute the workflow through cross-validation! ```r # Tune! default_cv_fit <- default_flow %>% tune_grid( default_cv, grid = expand_grid( cost_complexity = seq(0, 0.15, by = 0.01), tree_depth = c(1, 2, 5, 10), ), metrics = metric_set(accuracy) ) ``` --- # Decision Trees in R Plotting ```r ggplot( data = default_cv_fit %>% collect_metrics() %>% filter(.metric == "accuracy"), aes( x = cost_complexity, y = mean, color = tree_depth %>% factor(levels = c(1,2,5,10), ordered = T), shape = tree_depth %>% factor(levels = c(1,2,5,10), ordered = T) ) ) + geom_line(linewidth = 0.4) + geom_point(size = 3, alpha = 0.8) + scale_y_continuous("Accuracy") + scale_x_continuous("Cost complexity") + scale_color_viridis_d("Tree depth") + scale_shape_manual("Tree depth", values = c(1, 18, 15, 20)) + theme_minimal(base_size = 16) + theme(legend.position = "bottom") ``` --- # Decision Trees in R <img src="data:image/png;base64,#ML-Trees_files/figure-html/unnamed-chunk-17-1.png" width="120%" style="display: block; margin: auto;" /> --- # Decision Trees in R .hi-slate[To plot the CV-chosen tree...] 1\. .hi-pink[Fit] the chosen/best model with `finalize_workflow()` and `select_best()` ```r best_flow <- default_flow %>% finalize_workflow(select_best(default_cv_fit, metric = "accuracy")) %>% fit(data = default_df) ``` -- 2\. .hi-purple[Extract] the fitted model with `extract_fit_parsnip()` ```r best_tree <- best_flow %>% extract_fit_parsnip() ``` --- # Decision Trees in R 3\. .hi-red[Plot] the tree with `rpart.plot()` from .hi-slate[rpart.plot]. ```r best_tree$fit %>% rpart.plot() ``` <img src="data:image/png;base64,#ML-Trees_files/figure-html/plot-rpart-cv-1.png" width="80%" style="display: block; margin: auto;" /> --- # Decision Trees in R The previous tree has cost complexity of 0.03 (and a max. depth of 5). We can compare this "best" tree to a less pruned/penalized tree - `cost_complexity = 0.005` - `tree_depth = 5` --- # Decision Trees in R <img src="data:image/png;base64,#ML-Trees_files/figure-html/plot-tree_complex-1.png" width="120%" style="display: block; margin: auto;" /> --- # Decision Trees vs. Linear Models <br><br> .center[ .hi-medgrn[Q] How do trees compare to linear models?] -- .center[ .hi-medgrn[A] It depends how linear the true boundary is. ] --- # Linear Boundary .pull-left.center[.hi-medgrn[Linear models] recreate the line] .pull-right.center[.hi-blue[trees] struggle at replicating linearity] <img src="data:image/png;base64,#images/compare-linear.png" width="80%" style="display: block; margin: auto;" /> .ex.small[Source: ISL, p. 315] --- # Nonlinear Bounday .pull-left.center[.hi-medgrn[Linear models] struggle with nonlinear boundaries] .pull-right.center[.hi-blue[trees] easily replicate these relationships] <img src="data:image/png;base64,#images/compare-nonlinear.png" width="80%" style="display: block; margin: auto;" /> .ex.small[Source: ISL, p. 315] --- # Strengths and weaknesses As with any method, decision trees have .hi-medgrn[tradeoffs]. -- .pull-left.purple.font90[ .hi-purple.center[Strengths] <br>.b[+] Easily explained/interpretted <br>.b[+] Include several graphical options <br>.b[+] Mirror human decision making? <br>.b[+] Handle num. or cat. on LHS/RHS ] -- .pull-right.pink.font90[ .hi-pink.center[Weaknesses] <br>.b[-] Outperformed by other methods <br>.b[-] Struggle with linearity <br>.b[-] Can be very "non-robust" - small data changes can cause huge changes in our tree ] -- .note[Next:] Create ensembles of trees.super[🌲] to strengthen these weaknesses..super[🌴] with... .footnote[ 🌲 Forests! 🌴 Which will also weaken some of the strengths. ] --- class: inverse, middle name: forests # 🌲🌴🌳🌲🌴🌳🌲🌴🌳🌲🌴🌳🌲🌴🌳🌴🌳🌲🌴🌳🌲🌴🌳🌲🌴🌳🌲🌴🌳🌴🌳🌲🌴🌳🌲🌴🌳🌲🌴🌳🌲🌴🌳🌴🌳🌲🌴🌳🌲🌴🌳🌲🌴🌳🌲🌴🌳🌲🌴🌳 Random Forests 🌲🌴🌳🌲🌴🌳🌲🌴🌳🌲🌴🌳🌲🌴🌳🌴🌳🌲🌴🌳🌲🌴🌳🌲🌴🌳🌲🌴🌳🌴🌳🌲🌴🌳🌲🌴🌳🌲🌴🌳🌲🌴🌳🌴🌳🌲🌴🌳🌲🌴🌳🌲🌴🌳🌲🌴🌳🌲🌴🌳 --- # Random Forests .hi-medgrn[The gist:] combine many individual trees into a single .hi-medgrn[forest] via .hi-pink[bootstrap aggregation (bagging)] -- .hi-pink[Bagging:] 1. Create `\(B\)` bootstrapped samples -- 1. Train an estimator (tree) `\(\color{#6A5ACD}{\mathop{\hat{f^b}}(x)}\)` on each of the `\(B\)` samples -- 1. Aggregate across your `\(B\)` bootstrapped models: $$ `\begin{align} \color{#e64173}{\mathop{\hat{f}_{\text{bag}}}(x)} = \dfrac{1}{B}\sum_{b=1}^{B}\color{#6A5ACD}{\mathop{\hat{f^b}}(x)} \end{align}` $$ This aggregated model `\(\color{#e64173}{\mathop{\hat{f}_{\text{bag}}}(x)}\)` is your final model. --- # Bagging When we apply bagging to decision trees, - We typically .hi-pink[grow the trees deep and do not prune] - For .hi-purple[regression], we .hi-purple[average] across the `\(B\)` trees' regions - For .hi-purple[classification], we have more options—but often take .hi-purple[plurality] -- .hi-pink[Individual] (unpruned) trees will be very .hi-pink[flexible] and .hi-pink[noisy], <br>but their .hi-purple[aggregate] will be quite .hi-purple[stable]. --- # Out-of-Bag Error Bagging also offers a convenient method for evaluating performance. -- For any bootstrapped sample, we omit ∼n/3 observations. .hi-pink[Out-of-bag (OOB) error estimation] estimates the test error rate using observations .hi-medgrn[randomly omitted] from each bootstrapped sample. -- For each observation `\(i\)`: 1. Find all samples `\(S_i\)` in which `\(i\)` was omitted from training. 1. Aggregate the `\(|S_i|\)` predictions `\(\color{#6A5ACD}{\mathop{\hat{f^b}}(x_i)}\)`, _e.g._, using their mean or mode 1. Calculate the error, _e.g._, `\(y_i - \mathop{\hat{f}_{i,\text{OOB},i}}(x_i)\)` --- # Out-of-Bag Error When `\(B\)` is big enough, the OOB error rate will be very close to LOOCV. -- .hi-medgrn[Q:] Why use OOB error rate? -- .hi-medgrn[A:] When `\(B\)` and `\(n\)` are large, cross validation—with any number of folds—can become pretty computationally intensive. --- # Random Forests .hi-medgrn[Random forests] overcome a major shortcoming of bagging: .hi-pink[very correlated trees] * Since trees consider all predictors at every split of every tree, the trees will often be highly related. -- .hi-medgrn[Solution:] decorrelate the trees! * Only consider a .hi-pink[random subset] of `\(\color{#e64173}{m\enspace (\approx\sqrt{p})}\)` .pink[predictors] when making each split (for each tree) -- * This nudges trees away from always using the same variables, increasing variation across trees, and potentially reducing the variance of our estimates -- * If our predictors are very correlated, we may want to shrink `\(m\)` --- # Random Forests Random forests thus introduce .hi-slate[two dimensions of random variation] -- 1. The .hi-medgrn[bootstrapped sample] -- 2. The `\(m\)` .hi-blue[randomly selected predictors] (for the split) Everything else about random forests works just as it did with decision trees. --- # Random Forests in R Let's see how to train a random forest using .hi-slate[tidymodels]. -- Let's try and predict heart disease occurrence as a function of .pull-left.font80[ * `age` Patient age * `sex` Gender (`sex=1` for male) * `chest_pain` Type of chest pain * `rest_bp` Resting blood pressure * `chol` Serum cholesterol level * `fbs` Fasting blood sugar above 120mg/dl * `restecg` Resting ECG results * 0 normal, 1 have ST-T wave abnormality, 2 probable/definite left ventricular hypertrophy * `max_hr` Max heart rate ] .pull-right.font80[ * `ex_ang` Exercised induced angina (pain from reduced blood flow) * 1 yes * `oldpeak` ST depression induced by exercise relative to rest (observed in ECG) * `slope` Slope of the peak exercise ST segment (observed in ECG) * 0 upsloping, 1 flat, 2 downsloping * `ca` Number of major vessels colored by fluoroscopy (`ca`) * `thal` Thalium stress test result * 0 normal, 1 fixed defect, 2 reversible defect, 3 not described ] --- # Random Forests in R Let's first define our recipe and do some preprocessing. * .hi-medgrn[Recipe:] predict heart disease (Yes/No) as a function of everything else * .hi-medgrn[Cleaning:] imput missing values (median for numeric, mode for categorical) ```r # Impute missing values heart_recipe <- recipe(heart_disease ~ ., data = heart_df) %>% step_impute_median(all_predictors() & all_numeric()) %>% step_impute_mode(all_predictors() & all_nominal()) ``` -- Now we can use `prep()` and `juice()` to replace our data frame with the cleaned version: ```r heart_df_clean <- heart_recipe %>% prep() %>% juice() ``` --- # Random Forests in R You have several [options](http://topepo.github.io/caret/train-models-by-tag.html#Random_Forest) for training random forests with `tidymodels`..super[🎄] * _E.g._ `ranger`, `randomForest`, `spark`. `rand_forest()` accesses each of these packages via their .it[engines]. .footnote[ 🎄 And even more if you look outside of `tidymodels`. ] -- - The default engine is `"ranger"` ([`ranger`](https://cran.r-project.org/web/packages/ranger/index.html) package). -- - The argument `mtry` gives `\(m\)`, the # of predictors at each split. -- You've already seen the other hyperparameters for `ranger`: - `trees` the number of trees in your (random) forest - `min_n` min. # of observations --- # Random Forests in R Training a random forest in R using .hi-slate[tidymodels] and .hi-slate[ranger]: .pull-right[ ```r # Define the random forest heart_rf <- rand_forest( mode = "classification", mtry = 3, trees = 100, min_n = 2 ) %>% set_engine( engine = "ranger", splitrule = "gini", importance = "impurity" ) ``` ] --- # Random Forests in R Training a random forest in R using .hi-slate[tidymodels] and .hi-slate[ranger]: .pull-left.font90[ <br> - Goal: Classification ] .pull-right[ ```r # Define the random forest heart_rf <- rand_forest( * mode = "classification", mtry = 3, trees = 100, min_n = 2 ) %>% set_engine( engine = "ranger", splitrule = "gini", importance = "impurity" ) ``` ] --- # Random Forests in R Training a random forest in R using .hi-slate[tidymodels] and .hi-slate[ranger]: .pull-left.font90[ <br> - Goal: Classification - Try 3 variables per split ] .pull-right[ ```r # Define the random forest heart_rf <- rand_forest( mode = "classification", * mtry = 3, trees = 100, min_n = 2 ) %>% set_engine( engine = "ranger", splitrule = "gini", importance = "impurity" ) ``` ] --- # Random Forests in R Training a random forest in R using .hi-slate[tidymodels] and .hi-slate[ranger]: .pull-left.font90[ <br> - Goal: Classification - Try 3 variables per split - 100 trees in the forest ] .pull-right[ ```r # Define the random forest heart_rf <- rand_forest( mode = "classification", mtry = 3, * trees = 100, min_n = 2 ) %>% set_engine( engine = "ranger", splitrule = "gini", importance = "impurity" ) ``` ] --- # Random Forests in R Training a random forest in R using .hi-slate[tidymodels] and .hi-slate[ranger]: .pull-left.font90[ <br> - Goal: Classification - Try 3 variables per split - 100 trees in the forest - At least 2 obs. to split ] .pull-right[ ```r # Define the random forest heart_rf <- rand_forest( mode = "classification", mtry = 3, trees = 100, * min_n = 2 ) %>% set_engine( engine = "ranger", splitrule = "gini", importance = "impurity" ) ``` ] --- # Random Forests in R Training a random forest in R using .hi-slate[tidymodels] and .hi-slate[ranger]: .pull-left.font90[ <br> - Goal: Classification - Try 3 variables per split - 100 trees in the forest - At least 2 obs. to split - Choose the `ranger` engine ] .pull-right[ ```r # Define the random forest heart_rf <- rand_forest( mode = "classification", mtry = 3, trees = 100, min_n = 2 ) %>% set_engine( * engine = "ranger", splitrule = "gini", importance = "impurity" ) ``` ] --- # Random Forests in R Training a random forest in R using .hi-slate[tidymodels] and .hi-slate[ranger]: .pull-left.font90[ <br> - Goal: Classification - Try 3 variables per split - 100 trees in the forest - At least 2 obs. to split - Choose the `ranger` engine - Set a [splitting rule](https://dials.tidymodels.org/reference/ranger_parameters.html) ] .pull-right[ ```r # Define the random forest heart_rf <- rand_forest( mode = "classification", mtry = 3, trees = 100, min_n = 2 ) %>% set_engine( engine = "ranger", * splitrule = "gini", importance = "impurity" ) ``` ] --- # Random Forests in R Training a random forest in R using .hi-slate[tidymodels] and .hi-slate[ranger]: .pull-left.font90[ <br> - Goal: Classification - Try 3 variables per split - 100 trees in the forest - At least 2 obs. to split - Choose the `ranger` engine - Set a [splitting rule](https://dials.tidymodels.org/reference/ranger_parameters.html) - Set the method for calculating variable importance ] .pull-right[ ```r # Define the random forest heart_rf <- rand_forest( mode = "classification", mtry = 3, trees = 100, min_n = 2 ) %>% set_engine( engine = "ranger", splitrule = "gini", * importance = "impurity" ) ``` ] --- # Random Forests in R We can then build our workflow and fit the random forest: --- # Interpreting Random Forests While ensemble methods like random forests tend to .hi[improve predictive performance], they also tend to .hi[reduce interpretability]. -- We can illustrate .hi-blue[variables' importance] by considering their splits' reductions in the model's .hi-blue[performance metric] (RSS, Gini, entropy, _etc._)..super[🌳] .footnote[ 🌳 This idea isn't exclusive to forests; it also works for a single tree. ] -- .note[Note] By default, many variable importance functions will scale importance. --- # Variable Importance Plotting the most important variables using `vip()` from .hi-slate[vip]: ```r # extract the fitted forest extr <- heart_rf_fit %>% extract_fit_parsnip() # plot variable importance extr %>% vip::vip() ``` <img src="data:image/png;base64,#ML-Trees_files/figure-html/unnamed-chunk-27-1.png" width="60%" style="display: block; margin: auto;" /> --- # Variable Importance Or extracting manually: .font90[ ```r var_imp <- data.frame(variable = names(extr$fit$variable.importance), importance = extr$fit$variable.importance %>% as.numeric() # as.numeric removes row names ) %>% arrange(desc(importance)) var_imp ```
variable
importance
chest_pain
21.6
ca
18.4
oldpeak
16.4
max_hr
16
thal
15.1
age
12.4
rest_bp
11.6
chol
11
slope
7.42
ex_ang
7.31
sex
4.44
rest_ecg
2.75
fbs
1.4
] --- # Variable Importance Standardizing to 0-100 scale: ```r var_imp <- var_imp %>% # standardize importance mutate(imp_std = importance - min(importance), imp_std = 100 * imp_std/ max(imp_std)) ``` --- # Variable Importance Plotting standardized importance: <img src="data:image/png;base64,#ML-Trees_files/figure-html/plot-var-importance-1.png" width="80%" style="display: block; margin: auto;" /> --- # Random Forest Error Out-of-bag error is hidden pretty deep in the fitted model: ```r rf_error <- heart_rf_fit$fit$fit$fit$prediction.error ``` Which means we predicted 87% of heart disease cases correctly with our forest! --- # Random Forest Predictions Extracting the predictions confirms another .hi-medgrn[interpretability challenge] of random forests/ensemble methods: <br> since we .hi-medgrn[average over all tree predictions], our predicted classifications represent the .hi-medgrn[share of trees that assign the observation to class] `\(s\)`..super[3] .footnote[3\. The result of this averaging is more interpretable for regression forests] --- # Random Forest Predictions Extracting the predictions and viewing them: .font90[ ```r rf_fit <- heart_rf_fit %>% extract_fit_parsnip() rf_pred <- rf_fit$fit$predictions %>% as.data.frame() head(rf_pred) ```
No
Yes
0.6
0.4
0.0303
0.97
0.025
0.975
0.526
0.474
1
0
0.935
0.0645
] --- # Random Forest Predictions Like with the predicted probabilities with the Bayes classifier, we can assign each observation to the .hi-medgrn[most frequently predicted class]: ```r rf_pred <- rf_pred %>% mutate(heart_disease_hat = case_when( Yes > No ~ "Yes", # assign yes if more trees assign the obs to Yes TRUE ~ "No" # otherwise assign to No )) ``` -- Adding the predictions into the cleaned data frame: ```r heart_df_clean <- mutate(heart_df_clean, heart_disease_hat = rf_pred$heart_disease_hat, correct = ifelse( heart_disease == heart_disease_hat, 1, 0)) ``` --- # Random Forest Predictions Seeing how we did at predicting each class: .font90[ ```r group_by(heart_df_clean, heart_disease) %>% summarise(Correct = mean(correct) %>% round(4), Count = n()) ```
heart_disease
Correct
Count
No
0.86
164
Yes
0.748
139
So we did relatively better at predicting the more frequent "No Heart Disease" cases..super[4.] ] .footnote[4\. Note that the unconditional average would get us a different value than the model's error rate. That's because the error rate is calculated at the individual tree level while the above error is a summary/average across all the trees' predictions] --- # Random Forest Cross-Validation Just like with our single tree, we can .hi-pink[tune] our random forest parameters via .hi-pink[cross-validation]. -- .hi-medgrn[1\.] Partition the data into training/test samples ```r set.seed(12345) split <- initial_split(data = heart_df_clean, prop = 0.5) # split 50/50 train_set <- training(split) test_set <- testing(split) ``` --- # Random Forest Cross-Validation .hi-medgrn[2\.] Setup the cross-validation with `vfold_cv()` ```r # 3-fold cross-validation heart_cv <- vfold_cv(train_set, v = 3) ``` --- # Random Forest Cross-Validation .hi-medgrn[3\.] Set up the random forest for cross-validating parameters ```r # Define the random forest heart_rf_cv <- rand_forest( mode = "classification", * mtry = tune(), trees = 100, * min_n = tune() ) %>% set_engine( engine = "ranger", splitrule = "gini", importance = "impurity" ) ``` --- # Random Forest Cross-Validation .hi-medgrn[4\.] Build a grid of possible hyperparameter values ```r # Define the grid of parameter values rf_grid <- expand_grid( mtry = 1:13, min_n = 1:15 ) ``` --- # Random Forest Cross-Validation .hi-medgrn[5\.] Set the workflow ```r rf_cv_wf <- workflow() %>% add_model(heart_rf_cv) %>% # new CV random forest setup add_recipe(heart_recipe) # same recipe as before ``` --- # Random Forest Cross-Validation .hi-medgrn[6\.] Tune! ```r rf_tune <- rf_cv_wf %>% tune_grid( resamples = heart_cv, # the CV setup grid = rf_grid # the grid of hyperparameter values to try ) ``` --- # Random Forest Cross-Validation Showing the 5 best-performing models: ```r show_best(x = rf_tune, metric = "accuracy", n = 5) %>% select(mtry:min_n, mean, std_err) ```
mtry
min_n
mean
std_err
2
1
0.847
0.0273
3
5
0.841
0.0352
4
1
0.841
0.0405
3
9
0.834
0.0183
3
6
0.834
0.0248
--- # Random Forests Random Forests offer a lot of improvement over single decision trees from a performance standpoint. -- However, they can often be harder to interpret. -- Perhaps more limiting to the types of questions we economists are interested in: it's not always a .hi-blue[conditional mean] that we're interested in. -- What if we want to harness the benefits of .hi-medgrn[forest-based ensemble methods] for identifying .hi-pink[any conditional object of interest?] --- class: inverse, middle name: causal # Machine Learning for Causal Treatment Effect Estimation --- # Machine Learning for Causal Treatment Effect Estimation The goal of .hi-medgrn[Random Forests] is to estimate a .hi-medgrn[conditional mean], `$$\mu(x_0) = E[Y~|~X = x_0 ]$$` -- But what if instead we wanted to estimate... .hi-blue[any] conditional function `\(\theta(x_0)\)`? * Conditional linear regression coefficients * Conditional causal treatment effects * Conditional local average treatment effects using instruments * Quantiles of the conditional distribution of `\(Y~|~X = x_0\)` * Conditional class probabilities -- Enter [Athey, Tibshirani, and Wager (2019)](https://doi.org/10.1214/18-AOS1709). --- # Generalized Random Forests .hi-medgrn[Generalized Random Forests] by [Athey, Tibshirani, and Wager (2019)](https://doi.org/10.1214/18-AOS1709) extends the ensemble methods of Random Forests to essentially any conditional function `\(\theta(x)\)` that can be identified by local moment conditions. -- They recast forests as an .hi-medgrn[adaptive locally weighted estimator] -- 1. Use the forest to calculate weighted set of neighbors for each observation (sound familiar?) * Employ problem-specific splitting rules to maximize the likelihood of capturing heterogeneity in `\(\theta(x)\)` -- 1. Use those neighbors to estimate `\(\theta(x)\)` as the solution to a local moment equation --- # GRF Setup .hi-blue[More formally:] Suppose you have an IID sample with `\(N\)` observations. -- You observe: -- * `\(O_i\)` an observable quantity that encodes information relevant to estimating `\(\theta(\cdot)\)` -- * For nonparametric regression, this is just our our outcome `\(O_i = \{Y_i\}\)` -- * For treatment effect estimation, `\(O_i = \{Y_i, W_i\}\)` -- * Generally can contain more info -- * Auxiliary covariates `\(Xi\)` --- # GRF .hi-medgrn[Goal:] Estimate solutions to local estimating equations of the form `$$E[\psi_{\theta(x), ~\nu(x)}(O_i~|~X_i = x)] = 0$$` -- * `\(\theta(x)\)` the parameter we care about (i.e. conditional mean, treatment effect, regression slope) -- * `\(\nu(x)\)` a (optional) nuisance parameter If conditional mean without noise, then `\(\psi_{\mu(x)}(Y_i) = Y_i - \mu(x)\)` --- # GRF Forest-Based Local Estimation .hi-purple[One approach] to estimating functions like `\(\theta(x)\)`: * Find `\(\hat{\theta}(x), \hat{\nu}(x)\)` that solve `$$\sum\limits_{i=1}^n \alpha_i \psi_{\hat\theta(x), \hat\nu(x)} (O_i) = 0$$` -- * `\(\alpha_i\)`: weights measuring relevance of the `\(i^{th}\)` observation for fitting `\(\theta(\cdot)\)` at `\(x\)` * Traditionally obtained using a kernel function (and sometimes a bandwidth) * Traditionally sensitive to curse of dimensionality --- # GRF Forest-Based Local Estimation .hi-medgrn[GRF Approach:] use forests to derive `\(\alpha_i\)`. -- 1. Grow `\(b= 1,..., B\)` trees -- 1. Choose weights `\(\alpha_i(x)\)` to capture how frequently the `\(i^{th}\)` training example falls into the same leaf as `\(x\)`. -- Weights within a given tree, `\(\alpha_{bi}(x)\)`, are calculated as `$$\alpha_{bi}(x) = \frac{I[X_i \in L_b(x)]}{|L_b(x)|}$$` * `\(L_b(x)\)` the set of observations falling within the same leaf as `\(x\)` -- Overall weights for observation `\(i\)` are then the average across all trees: .font90[ `$$\alpha_i(x) = \frac{1}{B} \sum_{b=1}^B \alpha_{bi}(x)$$` ] --- # GRF Weights For example, we can reframe random forests and our estimate of the conditional mean function as `$$\hat\mu(x) = \sum_{i=1}^n \frac{1}{B} \sum_{b=1}^B Y_i \frac{I[X_i \in L_b(x)]}{|L_b(x)|}$$` -- Or as the weighted sum of single tree predictions `\(\hat\mu_b(\cdot)\)`: `$$\hat\mu(x) = \frac{1}{B} \sum_{b=1}^B \hat\mu_b(x)$$` -- Or more simply, as `$$\hat\mu(x) = \sum_{i=1}^n Y_i \alpha_i(x)$$` --- # GRF Weights Visually Let's see how these weights are developed with a simple example: a forest with three trees and two covariates. .pull-left[ * `\(x\)` the point of interest * lines our splits * darker dots the points within the same leaf ] .center.pull-right[ .hi-green[Tree 1] <img src = "images/splits1.png" width = "350" /img> ] --- # GRF Weights Visually Let's see how these weights are developed with a simple example: a forest with three trees and two covariates. .pull-left[ * `\(x\)` the point of interest * lines our splits * darker dots the points within the same leaf ] .center.pull-right[ .hi-purple[Tree 2] <img src = "images/splits2.png" width = "350" /img> ] --- # GRF Weights Visually Let's see how these weights are developed with a simple example: a forest with three trees and two covariates. .pull-left[ * `\(x\)` the point of interest * lines our splits * darker dots the points within the same leaf ] .center.pull-right[ .hi-pink[Tree 3] <img src = "images/splits3.png" width = "350" /img> ] --- # GRF Weights Visually Let's see how these weights are developed with a simple example: a forest with three trees and two covariates. .pull-left[ Then we average across all the individual tree-based weights to obtain the overall `\(\alpha_i(x)\)`. ] .center.pull-right[ .hi-purple[Forest Weights for] `\(x\)` <img src = "images/splits4.png" width = "350" /img> ] --- # Honesty One additional benefit of using GRF: growing .hi-medgrn[honest] forests -- .hi-medgrn[Honesty] aims to reduce bias in tree predictions by using .hi-medgrn[distinct subsamples] for 1\. Building the tree, and 2\. Making predictions -- .pull-left.center[ .hi-purple[Classic Random Forest] * Draws a single subsample, use that subsample for both choosing a tree's splits and making predictions in that tree ] .pull-right.center[ .hi-medgrn[Honest Forests] * Draw a training sample * Use part of that sample to choose the tree's splits * Use the rest of the training sample to make predictions * Prune away any remaining empty leaves ] --- # GRF Functions Conveniently for us, the .hi-slate[grf] package contains tons of features and [extensive documentation](https://grf-labs.github.io/grf/index.html) to estimate a wide range of forests, including... .hi-blue[Causal Treatment Effects] | Approach | Forest Function | | --------------------- | --------------- | | Causal Treatment Effects | `causal_forest()` | | Causal Treatment Effects with right-censored outcomes |`causal_survival_forest()` | | Multi-arm/multi-outcome causal treatment effects | `multi_arm_causal_forest()` | --- # GRF Functions Conveniently for us, the .hi-slate[grf] package contains tons of features and [extensive documentation](https://grf-labs.github.io/grf/index.html) to estimate a wide range of forests, including... .hi-medgrn[Moments of Conditional Distributions] | Approach | Forest Function | | --------------------- | --------------- | | Conditional Mean | `regression_forest()`| | Multi-outcome conditional means | `multi_regression_forest()`| | Right-censored Survival | `survival_forest()` | | Conditional Quantiles | `quantile_forest()` | | Conditional Class Probabilities | `probability_forest()`| --- # GRF Functions Conveniently for us, the .hi-slate[grf] package contains tons of features and [extensive documentation](https://grf-labs.github.io/grf/index.html) to estimate a wide range of forests, including... .hi-purple[Regression Outcomes] | Approach | Forest Function | | --------------------- | --------------- | | Conditional Linear Model | `lm_forest()`| | Local Average Treatment Effects (IV) | `instrumental_forest()`| --- # Causal Forests When `\(W_i \in \{0,1\}\)` and .hi-medgrn[unconfoundedness holds], we can estimate heterogeneous causal treatment effects with .hi-blue[causal forests]. -- .hi-medgrn[1\. Build Phase:] greedily choose splits to maximize the .hi-green[squared difference in subgroup treatment effects] `$$n_Ln_R(\hat{\tau}_{L} - \hat{\tau}_R)^2$$` * `\(\hat\tau\)`'s chosen through a centered residual-on-residual regression (Robinson 1988) --- # Causal Forests .hi-purple[2\. Estimate] `\(\tau(x)\)`, where `$$\tau(x) := lm~\left(Y_i - \hat{m}^{-1}(X_i) \sim W_i - \hat{e}^{-1}(X_i), \text{ weights} = \alpha_i(x)\right)$$` * `\(\left(Y_i - \hat{m}^{-1}(X_i), ~W_i - \hat{e}^{-1}(X_i)\right)\)`: orthogonalized outcome/treatment * Residual after "regressing out" the main effect of `\(X_i\)` on `\(Y_i\)` (W_i) * `\(\hat{m}^{-1}(X_i)\)`, `\(\hat{e}^{-1}(X_i)\)` obtained from separate regression forests --- # Causal Forests Example Let's load in some data to see how we can run causal forests, using a sample from the .hi-medgrn[National Study of Learning Mindsets]. .center[ <img src = "images/NSLM.png" width = "450" /img> ] ```r nslm <- read.csv("data/NSLM.csv") %>% mutate(schoolid = factor(schoolid)) ``` --- # Causal Forests Example Taking a look at the data, we have the following variables: .pull-left.font90[ * `schoolid` the ID of randomly selected US public high schools * `Y`: continuous measure of achievement * `Z`: treatment status * `S3`: students' self-reported expectations for future success * `C1`: student race (categorical) * `C2`: student gender (categorical) * `C3`: first-generation status (categorical) * `XC`: school urbanicity (categorical) ] .pull-right.font90[ * `X1`: school-level mean of students' fixed mindsets * `X2`: school achievement level (test scores + college prep) * `X3`: school racial minority composition (% black, latino, native american) * `X4`: school poverty (% in familes below FPL) * `X5`: school size ] --- # Causal Forests Example We want to answer the following research questions: -- 1. Was a nudge-like mindset intervention designed to instill a "growth mindset" in students effective at improving achievement? -- 1. Do schools' prior achievement levels (`X2`) and pre-existing mindset norms (`X2`) effect the magnitude of this effect? -- Let's find out! --- # Causal Forests Example We *could* just run an (interacted) OLS regression... .font90[ ```r reg1 <- feols(Y ~ Z + S3 + C1 + C2 + C3 + XC + X1 + X2 + X3 + X4 + X5, data = nslm, vcov = "HC1") reg2 <- feols(Y ~ Z + Z:X2 + Z:X3 + S3 + C1 + C2 + C3 + XC + X1 + X2 + X3 + X4 + X5, data = nslm, vcov = "HC1") etable(reg1, reg2, keep = c("Z", "X2", "X1")) ```
reg1
reg2
Dependent Var.:
Y
Y
Z
0.2549*** (0.0115)
0.2539*** (0.0115)
X1
-0.0954*** (0.0073)
-0.0954*** (0.0073)
X2
-0.0163. (0.0087)
-0.0263** (0.0098)
Z x X2
0.0313* (0.0141)
Z x X3
0.0152 (0.0134)
_______________
___________________
___________________
S.E. type
Heteroskedast.-rob.
Heteroskedast.-rob.
Observations
10,391
10,391
R2
0.27795
0.27829
Adj. R2
0.27718
0.27739
] --- # Causal Forests Example ...which gives us an estimate of the mean treatment effect and mean mediating effect of `X1/X2`. -- Instead, let's use `causal_forest()` to obtain heterogeneous treatment effects for each school. Let's first build some data objects we'll use as arguments. -- 1\. Convert `C1` and `XC` to dummies: * `C1` has 15 levels, so rather than code each one by hand let's use `model.matrix` * Argument: formula ```r # expand C1 categorical variable into dummies (with intercept) C1.exp = model.matrix(~ factor(nslm$C1) + 0) ``` --- # Causal Forests Example ...which gives us an estimate of the mean treatment effect and mean mediating effect of `X1/X2`. -- Instead, let's use `causal_forest()` to obtain heterogeneous treatment effects for each school. Let's first build some data objects we'll use as arguments. Repeat for `XC`: ```r XC.exp = model.matrix(~ factor(nslm$XC) + 0) ``` --- # Causal Forests Example Next, build the predictor matrix ```r # first select all covariates except for XC and C1 X <- select(nslm, -schoolid:-Y, -XC, - C1) # Add in the dummies for XC and C1 Xmat <- cbind(X, XC.exp, C1.exp) %>% as.matrix() ``` -- And get the outcome and treatment vectors: ```r Y <- nslm$Y # our outcome Z <- nslm$Z # our treatment indicator ``` --- # Causal Forests Example To perform the residual-on-residual regression step, our forest wants predictions for `\(\hat{Y}\)` and `\(\hat{W}\)`. We can either -- 1. Omit them from the forest setup * The forest will estimate them for us in a separate regression forest -- 1. Run the initial regression forests manually and supply the predictions -- Let's do the latter, but first an important consideration... --- # Causal Forests Example As with random forests, by default GRF methods assume .hi-medgrn[independent observations]. -- Here, we know that treatment assignment occurred at the .hi-blue[school level], so students' treatment is dependent on which school that they're at. -- The good news is, we can account for this .hi-blue[clustering] within our forest to draw our random units at the school level. * `clusters` the clustering variable (here our `schoolid` factor) * `equalize.cluster.weights = TRUE` ensures we draw the same fraction frmo each cluster --- # Causal Forests Example Running the initial regression forests for our `\(\hat{Y}\)` and `\(\hat{W}\)` predictions: ```r Y_forest <- regression_forest(X = Xmat, Y = Y, * clusters = nslm$schoolid, * equalize.cluster.weights = TRUE) Y_hat <- predict(Y_forest)$predictions W_forest <- regression_forest(X = Xmat, Y = Z, clusters = nslm$schoolid, equalize.cluster.weights = TRUE) W_hat <- predict(W_forest)$predictions ``` --- # Causal Forests Example Setting up the causal forest: .pull-left-sm.font90[ <br> - `X`: the covariate matrix ] .pull-right-lg[ ```r # Define the random forest *cf <- causal_forest(X = Xmat, Y = Y, Y.hat = Y_hat, W = Z, W.hat = W_hat, clusters = nslm$schoolid, equalize.cluster.weights = TRUE, num.trees = 200, honesty = TRUE, honesty.fraction = 0.5, tune.parameters = "all" ) ``` ] --- # Causal Forests Example Setting up the causal forest: .pull-left-sm.font90[ <br> - `X`: the covariate matrix - `Y`: the outcome vector (our measure of student achievement) - `W`: the treatment vector ] .pull-right-lg[ ```r # Define the random forest cf <- causal_forest(X = Xmat, * Y = Y, * W = Z, Y.hat = Y_hat, W.hat = W_hat, clusters = nslm$schoolid, equalize.cluster.weights = TRUE, num.trees = 200, honesty = TRUE, honesty.fraction = 0.5, tune.parameters = "all" ) ``` ] --- # Causal Forests Example Setting up the causal forest: .pull-left-sm.font90[ <br> - `X`: the covariate matrix - `Y`: the outcome vector (our measure of student achievement) - `W`: the treatment vector - `Y.hat`: our prediction of Y as a function of X - `W.hat`: our prediction of W as a function of X ] .pull-right-lg[ ```r # Define the random forest cf <- causal_forest(X = Xmat, Y = Y, W = Z, * Y.hat = Y_hat, * W.hat = W_hat, clusters = nslm$schoolid, equalize.cluster.weights = TRUE, num.trees = 200, honesty = TRUE, honesty.fraction = 0.5, tune.parameters = "all" ) ``` ] --- # Causal Forests Example Setting up the causal forest: .pull-left-sm.font90[ <br> - `X`: the covariate matrix - `Y`: the outcome vector (our measure of student achievement) - `W`: the treatment vector - `Y.hat`: our prediction of Y as a function of X - `W.hat`: our prediction of W as a function of X - `clusters`: the cluster identier vector - `equalize.cluster.weights`: whether to draw equal sample sizes from each cluster ] .pull-right-lg[ ```r # Define the random forest cf <- causal_forest(X = Xmat, Y = Y, W = Z, Y.hat = Y_hat, W.hat = W_hat, * clusters = nslm$schoolid, * equalize.cluster.weights = TRUE, num.trees = 200, honesty = TRUE, honesty.fraction = 0.5, tune.parameters = "all" ) ``` ] --- # Causal Forests Example Setting up the causal forest: .pull-left-sm.font90[ <br> - `num.trees`: size of forest to grow (default is 2000) ] .pull-right-lg[ ```r # Define the random forest cf <- causal_forest(X = Xmat, Y = Y, W = Z, Y.hat = Y_hat, W.hat = W_hat, clusters = nslm$schoolid, equalize.cluster.weights = TRUE, * num.trees = 200, honesty = TRUE, honesty.fraction = 0.5, tune.parameters = "all" ) ``` ] --- # Causal Forests Example Setting up the causal forest: .pull-left-sm.font90[ <br> - `num.trees`: size of forest to grow (default is 2000) - `honesty`: whether or not to grow honestly (default to TRUE) - `honesty.fraction`: the subsample portion used for growing trees (defaults to 50%) ] .pull-right-lg[ ```r # Define the random forest cf <- causal_forest(X = Xmat, Y = Y, W = Z, Y.hat = Y_hat, W.hat = W_hat, clusters = nslm$schoolid, equalize.cluster.weights = TRUE, num.trees = 200, * honesty = TRUE, * honesty.fraction = 0.5, tune.parameters = "all" ) ``` ] --- # Causal Forests Example Setting up the causal forest: .pull-left-sm.font90[ <br> - `num.trees`: size of forest to grow (default is 2000) - `honesty`: whether or not to grow honestly (default to TRUE) - `honesty.fraction`: the subsample portion used for growing trees (defaults to 50%) - `tune.parameters`: which/whether to tune parameters - `none, all, sample.fraction, mtry, min.node.size, honesty.fraction, honesty.prune.leaves, alpha, imbalance.penalty` ] .pull-right-lg[ ```r # Define the random forest cf <- causal_forest(X = Xmat, Y = Y, W = Z, Y.hat = Y_hat, W.hat = W_hat, clusters = nslm$schoolid, equalize.cluster.weights = TRUE, num.trees = 200, honesty = TRUE, honesty.fraction = 0.5, * tune.parameters = "all" ) ``` ] --- # Table of Contents **Part 3: Tree-Based Methods** 1. [Decision Trees](#trees) 1. [Random Forests](#forests) 1. [Machine Learning for Causal Treatment Effect Estimation](#causal) 1. [Deep Learning (if time - not a tree method)](#deep)